Data Description: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars
Domain: Object recognition
Context: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles
Objective: Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
vehicle = pd.read_csv('vehicle.csv')
#Read vehicle data from csv file and display top 5 records
vehicle.head()
#Get number of rows and columns of the data records/structure of the file
vehicle.shape
There are 846 rows and 19 columns in total - including the target variable
#Get datatypes of each column - convert object data type(if any) to categorical data
vehicle.info()
The dataset has columns which are of integer and float types. Only the target variable('class') is of object type
# Checking if any missing value
vehicle.isnull().sum()
There is many null values present in the vehicle dataset. We need to process them further. Columns: circularity, distance_circularity, radius_ratio, pr.axis_aspect_ratio, scatter_ratio, elongatedness, pr.axis_rectangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration, scaled_radius_of_gyration.1, skewness_about.1, skewness_about.2 have missing data
#Checking the unique data
vehicle.nunique()
#Five point summary of attributes and label
#Transposing index and columns
vehicle.describe().T
scaled_variance.1 has very higher standard deviation radius_ratio, scatter_ratio, scaled_variance, scaled_radius_of_gyration - These columns Standard Deviation is also on the higher side skewness_about, skewness_about.1 have minimum values = 0.0 which is a possible value. These cannot be considered as blank values
# Lets see the value counts of the target varible - 'class'
vehicle['class'].value_counts()
There are 429 cars, 218 buses and 199 vans
## Pair plot that includes all the columns of the data frame
sns.pairplot(vehicle, hue="class")
plt.figure(figsize = (15,7))
plt.title('Correlation of Attributes', y=1.05, size=25)
sns.heatmap(vehicle.corr(), cmap='plasma',annot=True, fmt='.2f')
Inference drawn from Heatmap and Pairplot:
So our objective is to reocgnize whether an object is a van or bus or car based on some input features. so our main assumption is there is little or no multicollinearity between the features. if two features is highly correlated then there is no use in using both features. In that case, we can drop one feature. so heatmap gives us the correlation matrix there we can see which features are highly correlated. From above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has 1 correlation and many other features also there which having more than 0.9 correlation so we will drop those columns whose correlation is +-0.9 or above. so there are 8 such columns: ->max.length_rectangularity ->scaled_radius_of_gyration ->skewness_about.2 ->scatter_ratio ->elongatedness ->pr.axis_rectangularity ->scaled_variance ->scaled_variance.1
Analysis of each column with the help of plots
# Python function to build Distribution plot and box plot
def buildPlot(columnName):
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(vehicle[columnName],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle[columnName],ax=ax2)
ax2.set_title("Box Plot")
columnName = 'compactness'
buildPlot(columnName)
From above we can see that there are no outliers in compactness column and it's looks like normally distributed. Also there are no missing values
columnName = 'circularity'
buildPlot(columnName)
From above we can see that there are no outliers in circularity column and it looks like normally distributed. Since there are missing values, replacing them with median here.
vehicle['circularity'] = vehicle['circularity'].fillna(vehicle['circularity'].median())
columnName = 'distance_circularity'
buildPlot(columnName)
From above we can see that there are no outliers in distance_circularity column but in distribution plot we can see that there are two peaks and we can see that there is left skewness because long tail is at the left side(mean<median) Replacing missing values with median
vehicle['distance_circularity'] = vehicle['distance_circularity'].fillna(vehicle['distance_circularity'].median())
columnName = 'radius_ratio'
buildPlot(columnName)
From above we can see that there are outliers in radius_ratio column and there is right skewness because long tail is at the right side(mean>median). As mean > median, replacing missing values with mean
vehicle['radius_ratio'] = vehicle['radius_ratio'].fillna(vehicle['radius_ratio'].mean())
columnName = 'pr.axis_aspect_ratio'
buildPlot(columnName)
From above we can see that there are outliers in pr.axis_aspect_ratio column and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['pr.axis_aspect_ratio'] = vehicle['pr.axis_aspect_ratio'].fillna(vehicle['pr.axis_aspect_ratio'].mean())
columnName = 'max.length_aspect_ratio'
buildPlot(columnName)
From above we can see that there are outliers in max.length_aspect_ratio and there is a right skewness because long tail is at right side(mean>median). There are no missing values for column - 'max.length_aspect_ratio'
columnName = 'scatter_ratio'
buildPlot(columnName)
From above we can see that there are no outliers in scatter_ratio column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['scatter_ratio'] = vehicle['scatter_ratio'].fillna(vehicle['scatter_ratio'].mean())
columnName = 'elongatedness'
buildPlot(columnName)
From above we can see that there are no outliers in elongatedness column and there are two peaks in distribution plot and there is left skewness because long tail is at left side(mean<median). As mean < median, replacing missing values with median
vehicle['elongatedness'] = vehicle['elongatedness'].fillna(vehicle['elongatedness'].median())
columnName = 'pr.axis_rectangularity'
buildPlot(columnName)
From above we can see that there are no outliers in pr.axis_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['pr.axis_rectangularity'] = vehicle['pr.axis_rectangularity'].fillna(vehicle['pr.axis_rectangularity'].mean())
columnName = 'max.length_rectangularity'
buildPlot(columnName)
From above we can see that there are no outliers in max.length_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['max.length_rectangularity'] = vehicle['max.length_rectangularity'].fillna(vehicle['max.length_rectangularity'].mean())
columnName = 'scaled_variance'
buildPlot(columnName)
From above we can see that there are outliers in scaled_variance column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['scaled_variance'] = vehicle['scaled_variance'].fillna(vehicle['scaled_variance'].mean())
columnName = 'scaled_variance.1'
buildPlot(columnName)
From above we can see that there are outliers in scaled_variance.1 column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['scaled_variance.1'] = vehicle['scaled_variance.1'].fillna(vehicle['scaled_variance.1'].mean())
columnName = 'scaled_radius_of_gyration'
buildPlot(columnName)
From above we can see that there are no outliers in scaled_radius_of_gyration column and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['scaled_radius_of_gyration'] = vehicle['scaled_radius_of_gyration'].fillna(vehicle['scaled_radius_of_gyration'].mean())
columnName = 'scaled_radius_of_gyration.1'
buildPlot(columnName)
From above we can see that there are outliers in scaled_radius_of_gyration.1 column and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['scaled_radius_of_gyration.1'] = vehicle['scaled_radius_of_gyration.1'].fillna(vehicle['scaled_radius_of_gyration.1'].mean())
columnName = 'skewness_about'
buildPlot(columnName)
From above we can see that there are outliers in skewness_about column and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['skewness_about'] = vehicle['skewness_about'].fillna(vehicle['skewness_about'].mean())
columnName = 'skewness_about.1'
buildPlot(columnName)
From above we can see that there are outliers in skewness_about.1 column and there is right skewness because long tail is at right side(mean>median). As mean > median, replacing missing values with mean
vehicle['skewness_about.1'] = vehicle['skewness_about.1'].fillna(vehicle['skewness_about.1'].mean())
columnName = 'skewness_about.2'
buildPlot(columnName)
From above we can see that there are no outliers in skewness_about.2 column and there is left skewness because long tail is at left side(mean<median). As mean < median, replacing missing values with median
vehicle['skewness_about.2'] = vehicle['skewness_about.2'].fillna(vehicle['skewness_about.2'].median())
columnName = 'hollows_ratio'
buildPlot(columnName)
From above we can see that there are no outliers in hollows_ratio column and there is left skewness because long tail is at left side(mean<median) There are no missing values for column - hollows_ratio
sns.countplot(vehicle['class'])
plt.show()
vehicle.isnull().sum()
#Droping the output column
vehicle_df = vehicle
cols=['class']
X = vehicle_df.drop(cols,axis=1)
y = vehicle_df['class'] #Predicted Output column
#Scale the data before applying an algorithm
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
XScaled = scaler.fit_transform(X)
#Spliting the data into 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(XScaled, y, test_size=0.3, random_state=1)
#Checking % of data split
print("{0:0.2f}% data is in training set".format((len(X_train)/len(vehicle_df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(vehicle_df.index)) * 100))
We have already seen below points in the above steps :
SVM Algorithm Model
kernel='linear'
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
# Building a Support Vector Machine on train data
svc_model = SVC(kernel='linear',random_state=1)
svc_model.fit(X_train, y_train)
y_predict_svm_linear = svc_model.predict(X_test)
# check the accuracy on the training set
acc_SVM_linear_train = svc_model.score(X_train, y_train)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_linear_train))
acc_SVM_linear_test = svc_model.score(X_test, y_test)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_linear_test))
kernel='rbf'
# Building a Support Vector Machine on train data with kernel = 'rbf'
svc_model = SVC(kernel='rbf',random_state=1)
svc_model.fit(X_train, y_train)
y_predict_svm_rbf = svc_model.predict(X_test)
# check the accuracy on the training set
acc_SVM_rbf_train = svc_model.score(X_train, y_train)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_rbf_train))
acc_SVM_rbf_test = svc_model.score(X_test, y_test)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_rbf_test))
kernel='poly'
#Building a Support Vector Machine on train data(changing the kernel)
svc_model = SVC(kernel='poly',random_state=1)
svc_model.fit(X_train, y_train)
y_predict_svm_poly = svc_model.predict(X_test)
# check the accuracy on the training set
acc_SVM_poly_train = svc_model.score(X_train, y_train)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_poly_train))
acc_SVM_poly_test = svc_model.score(X_test, y_test)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_poly_test))
kernel='sigmoid'
#Building a Support Vector Machine on train data(changing the kernel)
svc_model = SVC(kernel='sigmoid',random_state=1)
svc_model.fit(X_train, y_train)
y_predict_svm_sigmoid = svc_model.predict(X_test)
# check the accuracy on the training set
acc_SVM_sigmoid_train = svc_model.score(X_train, y_train)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_sigmoid_train))
acc_SVM_sigmoid_test = svc_model.score(X_test, y_test)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_sigmoid_test))
Observation:
Accuracy of SVM on test set is as below:
Accuracy of SVM with kernel='linear' and 'rbf' gives best results
We'll chose SVM with kernel='rbf' as the best model and it best results both on training and testing data. There is no overfit or underfit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 50
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
model = SVC(kernel='rbf',random_state=1)
kfold_results_raw = cross_val_score(model, XScaled, y, cv=kfold)
print(kfold_results_raw)
print("Accuracy: %.3f%% (%.3f%%)" % (kfold_results_raw.mean()*100.0, kfold_results_raw.std()*100.0))
The K-fold validation using SVM algorithm gives an Accuracy of 96.456% with standard deviation of +-4.085%
First, Lets scale the original data using StandardScaler or zscore which we have already done in the above steps and stored in XScaled
# Import the PCA and print the covariance matrix
from sklearn.decomposition import PCA
covMatrix = np.cov(XScaled,rowvar=False)
print(covMatrix)
# Apply PCA on all the 18 components of the vehicle dataset on the scaled data
pca = PCA(n_components=18,random_state=1)
pca.fit(XScaled)
# Print The eigen Values
print(pca.explained_variance_)
# Print the eigen Vectors
print(pca.components_)
# Print the percentage of variation explained by each eigen Vector
print(pca.explained_variance_ratio_)
# Plot the Eigen Value vs Variation explained bar chart
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
# Plot the Eigen Value vs Cummulative of variation explained step chart
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
Dimensionality Reduction: From the above 2 charts, its clearly evident that 8 dimensions seems very reasonable. With 8 variables we can explain over 95% of the variation in the original data
# New pca with 7 dimensions - after Dimensionality Reduction
pca_new = PCA(n_components=8,random_state=1)
pca_new.fit(XScaled)
# Print the new eigen Vectors
print(pca_new.components_)
# Print the new eigen values
print(pca_new.explained_variance_ratio_)
Xpca_new = pca_new.transform(XScaled)
Xpca_new
Xpca_new = pca_new.transform(XScaled)
# Pairplot
sns.pairplot(pd.DataFrame(Xpca_new))
Repeating Step 3: Split the data into train and test
X_trainNew, X_testNew, y_trainNew, y_testNew = train_test_split(Xpca_new, y, test_size=0.3, random_state=1)
#Checking % of data split
print("{0:0.2f}% data is in training set".format((len(X_trainNew)/len(vehicle_df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_testNew)/len(vehicle_df.index)) * 100))
Repeating Step 4 : Train a Support vector machine using the train set and get the accuracy on the test set
kernel='linear'
# Building a Support Vector Machine on train data with kernel = 'linear'
svc_model = SVC(kernel='linear',random_state=1)
svc_model.fit(X_trainNew, y_trainNew)
y_predict_svm_linear_pca = svc_model.predict(X_testNew)
y_predict_svm_linear_pca
# check the accuracy on the training and test set
acc_SVM_linear_train_pca = svc_model.score(X_trainNew, y_trainNew)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_linear_train_pca))
acc_SVM_linear_test_pca = svc_model.score(X_testNew, y_testNew)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_linear_test_pca))
kernel='rbf'
# Building a Support Vector Machine on train data with kernel='rbf'
svc_model = SVC(kernel='rbf',random_state=1)
svc_model.fit(X_trainNew, y_trainNew)
y_predict_svm_rbf_pca = svc_model.predict(X_testNew)
y_predict_svm_rbf_pca
# check the accuracy on the training and test set
acc_SVM_rbf_train_pca = svc_model.score(X_trainNew, y_trainNew)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_rbf_train_pca))
acc_SVM_rbf_test_pca = svc_model.score(X_testNew, y_testNew)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_rbf_test_pca))
kernel='poly'
# Building a Support Vector Machine on train data with kernel='poly'
svc_model = SVC(kernel='poly',random_state=1)
svc_model.fit(X_trainNew, y_trainNew)
y_predict_svm_poly_pca = svc_model.predict(X_testNew)
y_predict_svm_poly_pca
# check the accuracy on the training and test set
acc_SVM_poly_train_pca = svc_model.score(X_trainNew, y_trainNew)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_poly_train_pca))
acc_SVM_poly_test_pca = svc_model.score(X_testNew, y_testNew)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_poly_test_pca))
kernel='sigmoid'
# Building a Support Vector Machine on train data with kernel='poly'
svc_model = SVC(kernel='sigmoid',random_state=1)
svc_model.fit(X_trainNew, y_trainNew)
y_predict_svm_sigmoid_pca = svc_model.predict(X_testNew)
y_predict_svm_sigmoid_pca
# check the accuracy on the training and test set
acc_SVM_sigmoid_train_pca = svc_model.score(X_trainNew, y_trainNew)
print("Accuracy of SVM on train set: {0:.4f}".format(acc_SVM_sigmoid_train_pca))
acc_SVM_sigmoid_test_pca = svc_model.score(X_testNew, y_testNew)
print("Accuracy of SVM on test set: {0:.4f}".format(acc_SVM_sigmoid_test_pca))
Observation:
Accuracy of SVM on test set is as below:
Accuracy of SVM with kernel='linear' and 'rbf' gives best results
We'll chose SVM with kernel='rbf' as the best model and it best results both on training and testing data. There is no overfit or underfit
Repeating Step 5 : Perform K-fold cross validation and get the cross validation score of the model
num_folds = 50
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
model = SVC(kernel='rbf',random_state=1)
kfold_results_pca = cross_val_score(model, Xpca_new, y, cv=kfold)
print(kfold_results_pca)
print("Accuracy: %.3f%% (%.3f%%)" % (kfold_results_pca.mean()*100.0, kfold_results_pca.std()*100.0))
After Dimensionality reduction and PCA Accuracy of KFold cross validation using Support Vector Machine the accuracy = 94.912% with standard deviation of +-4.603%
resultsSvmDf = pd.DataFrame({'Method':['SVM-rbf-raw'], 'accuracy-train': acc_SVM_rbf_train, 'accuracy-test': acc_SVM_rbf_test})
resultsSvmDf = resultsSvmDf[['Method', 'accuracy-train', 'accuracy-test']]
tempResultsDf = pd.DataFrame({'Method':['SVM-rbf-pca'], 'accuracy-train': acc_SVM_rbf_train_pca, 'accuracy-test': acc_SVM_rbf_test_pca})
resultsSvmDf = pd.concat([resultsSvmDf, tempResultsDf])
resultsSvmDf
print("Accuracy of K-fold-cross-validation on raw data: %.3f%% (%.3f%%)" % (kfold_results_raw.mean()*100.0, kfold_results_raw.std()*100.0))
print("Accuracy of K-fold-cross-validation after pca: %.3f%% (%.3f%%)" % (kfold_results_pca.mean()*100.0, kfold_results_pca.std()*100.0))
Looks like after reducing dimensionality to 7, the model performs equally well. This is much better than managing lot of independent variables in n-dimensions
Even in the above case, after reducing dimensionality to 7, the model performs equally well. This is much better than managing lot of independent variables in n-dimensions